For the operation in the future, having a overview can help us understand the dataset easily.
dim(USvideos)
[1] 40949 16
str(USvideos)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 40949 obs. of 16 variables:
$ video_id : chr "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
$ trending_date : chr "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
$ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
$ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
$ category_id : num 22 24 23 24 24 28 24 28 1 25 ...
$ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" "2017-11-12 19:05:24" "2017-11-13 11:00:04" ...
$ tags : chr "SHANtell martin" "last week tonight trump presidency\"|\"last week tonight donald trump\"|\"john oliver trump\"|\"donald trump" "racist superman\"|\"rudy\"|\"mancuso\"|\"king\"|\"bach\"|\"racist\"|\"superman\"|\"love\"|\"rudy mancuso poo be"| __truncated__ "rhett and link\"|\"gmm\"|\"good mythical morning\"|\"rhett and link good mythical morning\"|\"good mythical mor"| __truncated__ ...
$ views : num 748374 2418783 3191434 343168 2095731 ...
$ likes : num 57527 97185 146033 10172 132235 ...
$ dislikes : num 2966 6146 5339 666 1989 ...
$ comment_count : num 15954 12703 8181 2146 17518 ...
$ thumbnail_link : chr "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
$ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ description : chr "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO ▶ \\n\\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confir"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...
- attr(*, "problems")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1533544 obs. of 5 variables:
..$ row : int 2 2 2 2 2 2 3 3 3 3 ...
..$ col : chr "tags" "tags" "tags" "tags" ...
..$ expected: chr "delimiter or quote" "delimiter or quote" "delimiter or quote" "delimiter or quote" ...
..$ actual : chr "|" "l" "|" "j" ...
..$ file : chr "'data/USvideos.csv'" "'data/USvideos.csv'" "'data/USvideos.csv'" "'data/USvideos.csv'" ...
- attr(*, "spec")=
.. cols(
.. video_id = [31mcol_character()[39m,
.. trending_date = [31mcol_character()[39m,
.. title = [31mcol_character()[39m,
.. channel_title = [31mcol_character()[39m,
.. category_id = [32mcol_double()[39m,
.. publish_time = [34mcol_datetime(format = "")[39m,
.. tags = [31mcol_character()[39m,
.. views = [32mcol_double()[39m,
.. likes = [32mcol_double()[39m,
.. dislikes = [32mcol_double()[39m,
.. comment_count = [32mcol_double()[39m,
.. thumbnail_link = [31mcol_character()[39m,
.. comments_disabled = [33mcol_logical()[39m,
.. ratings_disabled = [33mcol_logical()[39m,
.. video_error_or_removed = [33mcol_logical()[39m,
.. description = [31mcol_character()[39m
.. )
Now we need to make sure is there any outlier or mistake in the dataset.
First, test the column called “category_id”. There are 43 categories, therefore the values in the column should not be bigger than 43 or smaller than 1.
assert(data = USvideos, in_set(1, 43, allow.na = FALSE), category_id)
Column 'category_id' violates assertion 'in_set(1, 43, allow.na = FALSE)' 38547 times
[omitted 38542 rows]
Error: assertr stopped execution
There are 5 rows have NA in this column, we can just remove them later.
For the numerical columns in the dataset, based on the reality, all of them should be positive.
rr assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), views) r assert(data = USvideos, within_bounds(lower.bound = 0,upper.bound = Inf, allow.na = FALSE), likes)
rr assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), dislikes)
rr assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), comment_count)
Fortunately, all of the numbers are positive. There is no mistake.
And for the logical columns, all of the values should be TRUE or FALSE.
rr assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), comments_disabled) r assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), ratings_disabled)
rr assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), video_error_or_removed)
And there is no error too.
Because there are only several observations with NA values, we can just remove all of the rows which have NA value.
USvideos_NNA <- as.data.frame(na.omit(USvideos))
USvideos_NNA
Then we need to convert the column called “trending_date” with character type to normal date format in “lubridate” package.
USvideos_NNA <- USvideos_NNA %>%
mutate(trending_date = ydm(trending_date))
Now let’s look through the structure of dataset again.
str(USvideos_NNA)
'data.frame': 40371 obs. of 16 variables:
$ video_id : chr "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
$ trending_date : Date, format: "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" ...
$ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
$ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
$ category_id : num 22 24 23 24 24 28 24 28 1 25 ...
$ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" "2017-11-12 19:05:24" "2017-11-13 11:00:04" ...
$ tags : chr "SHANtell martin" "last week tonight trump presidency\"|\"last week tonight donald trump\"|\"john oliver trump\"|\"donald trump" "racist superman\"|\"rudy\"|\"mancuso\"|\"king\"|\"bach\"|\"racist\"|\"superman\"|\"love\"|\"rudy mancuso poo be"| __truncated__ "rhett and link\"|\"gmm\"|\"good mythical morning\"|\"rhett and link good mythical morning\"|\"good mythical mor"| __truncated__ ...
$ views : num 748374 2418783 3191434 343168 2095731 ...
$ likes : num 57527 97185 146033 10172 132235 ...
$ dislikes : num 2966 6146 5339 666 1989 ...
$ comment_count : num 15954 12703 8181 2146 17518 ...
$ thumbnail_link : chr "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
$ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ description : chr "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO ▶ \\n\\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confir"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...